<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Scaling Out with Celery</h1>
<p><code class="docutils literal notranslate"><span class="pre">CeleryExecutor</span></code> is one of the ways you can scale out the number of workers. For this
to work, you need to setup a Celery backend (<strong>RabbitMQ</strong>, <strong>Redis</strong>, &#x2026;) and
change your <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> to point the executor parameter to
<code class="docutils literal notranslate"><span class="pre">CeleryExecutor</span></code> and provide the related Celery settings.</p>
<p>For more information about setting up a Celery broker, refer to the
exhaustive <a class="reference external" href="http://docs.celeryproject.org/en/latest/getting-started/brokers/index.html">Celery documentation on the topic</a>.</p>
<p>Here are a few imperative requirements for your workers:</p>
<ul class="simple">
<li><code class="docutils literal notranslate"><span class="pre">airflow</span></code> needs to be installed, and the CLI needs to be in the path</li>
<li>Airflow configuration settings should be homogeneous across the cluster</li>
<li>Operators that are executed on the worker need to have their dependencies
met in that context. For example, if you use the <code class="docutils literal notranslate"><span class="pre">HiveOperator</span></code>,
the hive CLI needs to be installed on that box, or if you use the
<code class="docutils literal notranslate"><span class="pre">MySqlOperator</span></code>, the required Python library needs to be available in
the <code class="docutils literal notranslate"><span class="pre">PYTHONPATH</span></code> somehow</li>
<li>The worker needs to have access to its <code class="docutils literal notranslate"><span class="pre">DAGS_FOLDER</span></code>, and you need to
synchronize the filesystems by your own means. A common setup would be to
store your DAGS_FOLDER in a Git repository and sync it across machines using
Chef, Puppet, Ansible, or whatever you use to configure machines in your
environment. If all your boxes have a common mount point, having your
pipelines files shared there should work as well</li>
</ul>
<p>To kick off a worker, you need to setup Airflow and kick off the worker
subcommand</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>airflow worker
</pre>
</div>
</div>
<p>Your worker should start picking up tasks as soon as they get fired in
its direction.</p>
<p>Note that you can also run &#x201C;Celery Flower&#x201D;, a web UI built on top of Celery,
to monitor your workers. You can use the shortcut command <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">flower</span></code>
to start a Flower web server.</p>
<p>Some caveats:</p>
<ul class="simple">
<li>Make sure to use a database backed result backend</li>
<li>Make sure to set a visibility timeout in [celery_broker_transport_options] that exceeds the ETA of your longest running task</li>
<li>Tasks can and consume resources, make sure your worker as enough resources to run <cite>worker_concurrency</cite> tasks</li>
</ul>
</body>
</html>